{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear regression without scikit-learn\n",
    "\n",
    "In this notebook, we introduce linear regression. Before presenting the\n",
    "available scikit-learn classes, here we provide some insights with a simple\n",
    "example. We use a dataset that contains measurements taken on penguins."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins_regression.csv\")\n",
    "penguins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We aim to solve the following problem: using the flipper length of a penguin,\n",
    "we would like to infer its mass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "feature_name = \"Flipper Length (mm)\"\n",
    "target_name = \"Body Mass (g)\"\n",
    "data, target = penguins[[feature_name]], penguins[target_name]\n",
    "\n",
    "ax = sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "ax.set_title(\"Body Mass as a function of the Flipper Length\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition tip alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
    "<p class=\"last\">The function <tt class=\"docutils literal\">scatterplot</tt> from seaborn take as input the full dataframe and\n",
    "the parameter <tt class=\"docutils literal\">x</tt> and <tt class=\"docutils literal\">y</tt> allows to specify the name of the columns to be\n",
    "plotted. Note that this function returns a matplotlib axis (named <tt class=\"docutils literal\">ax</tt> in the\n",
    "example above) that can be further used to add elements on the same matplotlib\n",
    "axis (such as a title).</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "In this problem, penguin mass is our target. It is a continuous variable that\n",
    "roughly varies between 2700 g and 6300 g. Thus, this is a regression problem\n",
    "(in contrast to classification). We also see that there is almost a linear\n",
    "relationship between the body mass of the penguin and its flipper length. The\n",
    "longer the flipper, the heavier the penguin.\n",
    "\n",
    "Thus, we could come up with a simple formula, where given a flipper length we\n",
    "could compute the body mass of a penguin using a linear relationship of the\n",
    "form `y = a * x + b` where `a` and `b` are the 2 parameters of our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def linear_model_flipper_mass(\n",
    "    flipper_length, weight_flipper_length, intercept_body_mass\n",
    "):\n",
    "    \"\"\"Linear model of the form y = a * x + b\"\"\"\n",
    "    body_mass = weight_flipper_length * flipper_length + intercept_body_mass\n",
    "    return body_mass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the model we defined above, we can check the body mass values predicted\n",
    "for a range of flipper lengths. We set `weight_flipper_length` and\n",
    "`intercept_body_mass` to arbitrary values of 45 and -5000, respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "weight_flipper_length = 45\n",
    "intercept_body_mass = -5000\n",
    "\n",
    "flipper_length_range = np.linspace(data.min(), data.max(), num=300)\n",
    "predicted_body_mass = linear_model_flipper_mass(\n",
    "    flipper_length_range, weight_flipper_length, intercept_body_mass\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now plot all samples and the linear model prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "label = \"{0:.2f} (g / mm) * flipper length + {1:.2f} (g)\"\n",
    "\n",
    "ax = sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "ax.plot(flipper_length_range, predicted_body_mass)\n",
    "_ = ax.set_title(label.format(weight_flipper_length, intercept_body_mass))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The variable `weight_flipper_length` is a weight applied to the feature\n",
    "`flipper_length` in order to make the inference. When this coefficient is\n",
    "positive, it means that penguins with longer flipper lengths have larger\n",
    "body masses. If the coefficient is negative, it means that penguins with\n",
    "shorter flipper lengths have larger body masses. Graphically, this coefficient\n",
    "is represented by the slope of the curve in the plot. Below we show what the\n",
    "curve would look like when the `weight_flipper_length` coefficient is\n",
    "negative."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weight_flipper_length = -40\n",
    "intercept_body_mass = 13000\n",
    "\n",
    "predicted_body_mass = linear_model_flipper_mass(\n",
    "    flipper_length_range, weight_flipper_length, intercept_body_mass\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now plot all samples and the linear model prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "ax.plot(flipper_length_range, predicted_body_mass)\n",
    "_ = ax.set_title(label.format(weight_flipper_length, intercept_body_mass))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our case, this coefficient has a meaningful unit: g/mm. For instance, a\n",
    "coefficient of 40 g/mm, means that for each additional millimeter in flipper\n",
    "length, the body weight predicted increases by 40 g."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "body_mass_180 = linear_model_flipper_mass(\n",
    "    flipper_length=180, weight_flipper_length=40, intercept_body_mass=0\n",
    ")\n",
    "body_mass_181 = linear_model_flipper_mass(\n",
    "    flipper_length=181, weight_flipper_length=40, intercept_body_mass=0\n",
    ")\n",
    "\n",
    "print(\n",
    "    \"The body mass for a flipper length of 180 mm \"\n",
    "    f\"is {body_mass_180} g and {body_mass_181} g \"\n",
    "    \"for a flipper length of 181 mm\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also see that we have a parameter `intercept_body_mass` in our model.\n",
    "This parameter corresponds to the value on the y-axis if `flipper_length=0`\n",
    "(which in our case is only a mathematical consideration, as in our data, the\n",
    " value of `flipper_length` only goes from 170mm to 230mm). This y-value when\n",
    "x=0 is called the y-intercept. If `intercept_body_mass` is 0, the curve passes\n",
    "through the origin:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weight_flipper_length = 25\n",
    "intercept_body_mass = 0\n",
    "\n",
    "# redefined the flipper length to start at 0 to plot the intercept value\n",
    "flipper_length_range = np.linspace(0, data.max(), num=300)\n",
    "predicted_body_mass = linear_model_flipper_mass(\n",
    "    flipper_length_range, weight_flipper_length, intercept_body_mass\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "ax.plot(flipper_length_range, predicted_body_mass)\n",
    "_ = ax.set_title(label.format(weight_flipper_length, intercept_body_mass))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Otherwise, it passes through the `intercept_body_mass` value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weight_flipper_length = 45\n",
    "intercept_body_mass = -5000\n",
    "\n",
    "predicted_body_mass = linear_model_flipper_mass(\n",
    "    flipper_length_range, weight_flipper_length, intercept_body_mass\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "ax.plot(flipper_length_range, predicted_body_mass)\n",
    "_ = ax.set_title(label.format(weight_flipper_length, intercept_body_mass))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " In this notebook, we have seen the parametrization of a linear regression\n",
    " model and more precisely meaning of the terms weights and intercepts."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}